feat: make ICU the default FTS tokenizer by Xuanwo · Pull Request #6968 · lance-format/lance

Xuanwo · 2026-05-27T19:46:33Z

This changes the default native FTS tokenizer from simple to icu so new inverted indexes handle mixed-language text without requiring users to opt into multilingual tokenization. Legacy missing tokenizer metadata continues to resolve to simple, and builds without the ICU feature still fall back to simple.

Benchmark summary from the 100M-row runs:

Dataset	Retrieval impact	Build cost	Index size	Query latency
English-only	Recall unchanged	+15.4%	+0.6%	Common terms flat; rare terms slightly slower but still small
Mixed-language	ZH / JP / TH rare recall improves from 0.0 to 1.0	+20.4%	+25.7%	EN / FR flat; multilingual rare queries roughly flat

The key tradeoff is that ICU has modest overhead for English-only data, but it fixes default recall for unspaced CJK, Japanese, and Thai text.

Detailed benchmark numbers

English-only 100M rows

Metric	simple	icu	Difference
Build time	15.31s	17.66s	+15.4%
Index size delta	964.0MB	969.7MB	+0.6%
Common-term latency	2370.1ms / 1179.4ms	2372.5ms / 1188.8ms	Flat
Rare-term recall	1.0 / 1.0	1.0 / 1.0	No change
Rare-term latency	11.7ms / 14.2ms	23.3ms / 17.7ms	Slightly slower

Mixed-language 100M rows

Metric	simple	icu	Difference
Build time	33.35s	40.16s	+20.4%
Index size delta	800.1MB	1005.7MB	+25.7%
EN common latency	1026.7ms	953.8ms	Slightly faster
FR common latency	945.9ms	946.1ms	Flat
CJK common query	0 rows / 1.1ms	10 rows / 944.1ms	`simple` misses
ZH / JP / TH rare recall	0.0 / 0.0 / 0.0	1.0 / 1.0 / 1.0	ICU recovers recall
FR / EN rare recall	1.0 / 1.0	1.0 / 1.0	No change
Rare-query latency	13.7-24.6ms	11.1-23.1ms	Roughly flat

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

codecov · 2026-05-28T07:05:58Z

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/index.rs	0.00%	1 Missing ⚠️
rust/lance-index/src/scalar/inverted/tokenizer.rs	94.11%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace

Seems like good rationale. The only thing that makes me slightly hesitant is having default_base_tokenizer depend on a feature flag since it might be confusing.

Still, it's in the default feature list, and it's always going to be on for wheels / pylance, so I guess the vast majority of users will never be clearing this flag.

Xuanwo · 2026-05-28T15:29:08Z

The only thing that makes me slightly hesitant is having default_base_tokenizer depend on a feature flag since it might be confusing.

I'm open to just remove this feature and make it always enabled.

feat: make ICU the default FTS tokenizer

a3fb622

github-actions Bot added enhancement New feature or request python labels May 27, 2026

Xuanwo added 2 commits May 28, 2026 14:18

test: fix FTS tokenizer CI assumptions

d0894db

test: keep FTS fixtures tokenizer-neutral

8014c72

Xuanwo marked this pull request as ready for review May 28, 2026 06:37

claude Bot reviewed May 28, 2026

View reviewed changes

test: keep FTS compat index legacy-readable

5ef5c09

westonpace approved these changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make ICU the default FTS tokenizer#6968

feat: make ICU the default FTS tokenizer#6968
Xuanwo wants to merge 4 commits into
mainfrom
xuanwo/icu-default-fts-tokenizer

Xuanwo commented May 27, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 28, 2026

Uh oh!

westonpace left a comment

Uh oh!

Xuanwo commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xuanwo commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

English-only 100M rows

Mixed-language 100M rows

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 28, 2026

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xuanwo commented May 27, 2026 •

edited

Loading